[SOUND]
>> This
lecture is about topic mining and
analysis.
We're going to talk about its
motivation and task definition.
In this lecture we're going to talk
about different kind of mining task.
As you see on this road map,
we have just covered
mining knowledge about language,
namely discovery of
word associations such as paradigmatic and
relations and syntagmatic relations.
Now, starting from this lecture, we're
going to talk about mining another kind of
knowledge, which is content mining, and
trying to discover knowledge about
the main topics in the text.
And we call that topic mining and
analysis.
In this lecture, we're going to talk about
its motivation and the task definition.
So first of all,
let's look at the concept of topic.
So topic is something that we
all understand, I think, but
it's actually not that
easy to formally define.
Roughly speaking, topic is the main
idea discussed in text data.
And you can think of this as a theme or
subject of a discussion or conversation.
It can also have different granularities.
For example,
we can talk about the topic of a sentence.
A topic of article,
aa topic of paragraph or
the topic of all the research articles
in the research library, right,
so different grand narratives of topics
obviously have different applications.
Indeed, there are many applications that
require discovery of topics in text, and
they're analyzed then.
Here are some examples.
For example, we might be interested
in knowing about what are Twitter
users are talking about today?
Are they talking about NBA sports, or
are they talking about some
international events, etc.?
Or we are interested in
knowing about research topics.
For example, one might be interested in
knowing what are the current research
topics in data mining, and how are they
different from those five years ago?
Now this involves discovery of topics
in data mining literatures and
also we want to discover topics in
today's literature and those in the past.
And then we can make a comparison.
We might also be also interested in
knowing what do people like about
some products like the iPhone 6,
and what do they dislike?
And this involves discovering
topics in positive opinions about
iPhone 6 and
also negative reviews about it.
Or perhaps we're interested in knowing
what are the major topics debated in 2012
presidential election?
And all these have to do with discovering
topics in text and analyzing them,
and we're going to talk about a lot
of techniques for doing this.
In general we can view a topic as
some knowledge about the world.
So from text data we expect to
discover a number of topics, and
then these topics generally provide
a description about the world.
And it tells us something about the world.
About a product, about a person etc.
Now when we have some non-text data,
then we can have more context for
analyzing the topics.
For example, we might know the time
associated with the text data, or
locations where the text
data were produced,
or the authors of the text, or
the sources of the text, etc.
All such meta data, or
context variables can be associated
with the topics that we discover, and
then we can use these context variables
help us analyze patterns of topics.
For example, looking at topics over time,
we would be able to discover
whether there's a trending topic, or
some topics might be fading away.
Soon you are looking at topics
in different locations.
We might know some insights about
people's opinions in different locations.
So that's why mining
topics is very important.
Now, let's look at the tasks
of topic mining and analysis.
In general, it would involve first
discovering a lot of topics, in this case,
k topics.
And then we also would like to know, which
topics are covered in which documents,
to what extent.
So for example, in document one, we
might see that Topic 1 is covered a lot,
Topic 2 and
Topic k are covered with a small portion.
And other topics,
perhaps, are not covered.
Document two, on the other hand,
covered Topic 2 very well,
but it did not cover Topic 1 at all, and
it also covers Topic k to some extent,
etc., right?
So now you can see there
are generally two different tasks, or
sub-tasks, the first is to discover k
topics from a collection of text laid out.
What are these k topics?
Okay, major topics in the text they are.
The second task is to figure out
which documents cover which topics
to what extent.
So more formally,
we can define the problem as follows.
First, we have, as input,
a collection of N text documents.
Here we can denote the text
collection as C, and
denote text article as d i.
And, we generally also need to have
as input the number of topics, k.
But there may be techniques that can
automatically suggest a number of topics.
But in the techniques that we will
discuss, which are also the most useful
techniques, we often need to
specify a number of topics.
Now the output would then be the k
topics that we would like to discover,
in order as theta sub
one through theta sub k.
Also we want to generate the coverage of
topics in each document of d sub i And
this is denoted by pi sub i j.
And pi sub ij is the probability
of document d sub i
covering topic theta sub j.
So obviously for each document, we have
a set of such values to indicate to
what extent the document covers,
each topic.
And we can assume that these
probabilities sum to one.
Because a document won't be able to cover
other topics outside of the topics
that we discussed, that we discovered.
So now, the question is, how do we define
theta sub i, how do we define the topic?
Now this problem has not
been completely defined
until we define what is exactly theta.
So in the next few lectures,
we're going to talk about
different ways to define theta.
[MUSIC]

